Efficient and Effective Filtering of Duplication Detection in Large Database Applications

نویسنده

  • Ji Zhang
چکیده

In this paper, a robust filtering technique, called PC-Filter (PC stands for partition comparison), is proposed for effective and efficient duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using record partitions in duplicate detection. PC-Filter operates in three steps. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then generated by performing fast partition pruning. Finally, duplicate records are effectively detected through internal and external partition comparison based on PCG. Four closure properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. The partition size is well specified such that the time complexity of PC-Filter can be optimized. By equipping existing detection methods with PC-Filter, we are able to well solve the major problems that the existing methods suffer.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accurate Fruits Fault Detection in Agricultural Goods using an Efficient Algorithm

The main purpose of this paper was to introduce an efficient algorithm for fault identification in fruits images. First, input image was de-noised using the combination of Block Matching and 3D filtering (BM3D) and Principle Component Analysis (PCA) model. Afterward, in order to reduce the size of images and increase the execution speed, refined Discrete Cosine Transform (DCT) algorithm was uti...

متن کامل

Pixel-Based Skin Detection for Pornography Filtering

A robust skin detector is the primary need of many fields of computer vision, including face detection, gesture recognition, and pornography filtering. Less than 10 years ago, the first paper on automatic pornography filtering was published. Since then, different researchers claim different color spaces to be the best choice for skin detection in pornography filtering. Unfortunately, no com...

متن کامل

The Use of Aptamer in Detection of Pathogenic Bacteria-

Detection, identification and measurement of microbial pathogens is critical for protecting public health. Although microbial culture-based tests and molecular techniques are currently the most commonly used, these techniques are time-consuming and require complex tools and experienced individuals. Consequently, it is costly to analyze these techniques. The emergence of the aptamer led to the e...

متن کامل

Application of Recursive Least Squares to Efficient Blunder Detection in Linear Models

In many geodetic applications a large number of observations are being measured to estimate the unknown parameters. The unbiasedness property of the estimated parameters is only ensured if there is no bias (e.g. systematic effect) or falsifying observations, which are also known as outliers. One of the most important steps towards obtaining a coherent analysis for the parameter estimation is th...

متن کامل

Herbal plants zoning using target detection algorithms on time-series of Sentinel-2 multispectral images (Amygdalus Scoparia)

Today, medicinal plants have a special place in the economy and health of a society. Due to the natural growth of many of these products, the necessity of zoning them for optimum and optimal utilization seems necessary. Traditional zoning solutions are not efficient due to their low accuracy and speed, therefore a new approach is needed. Remote sensing data have many applications in various fie...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JSW

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2012